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Abstract 

The modern abundance and prominence of data have led to the development of “data 
science” as a new field of enquiry, along with a body of epistemological reflections 
upon its foundations, methods, and consequences. This article provides a systematic 
analysis and critical review of significant open problems and debates in the epistemol- 
ogy of data science. We propose a partition of the epistemology of data science into the 
following five domains: (i) the constitution of data science; (ii) the kind of enquiry that 
it identifies; (iii) the kinds of knowledge that data science generates; (iv) the nature 
and epistemological significance of “black box” problems; and (v) the relationship 
between data science and the philosophy of science more generally. 


Keywords Black box - Data-driven science - Data science - Epistemology - 
Foundationalism - Philosophy of science 


1 Introduction 


Data science has become a mature field of enquiry only recently, propelled by the 
proliferation of data and computing infrastructure. While many have written about the 
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philosophical problems in data science, such problems are rarely unified into a holistic 
“epistemology of data science” (we avoid the more generic expression “philosophy of 
data science”—more on this presently). In its current state, this epistemology is vibrant 
but chaotic. For this reason, in this article, we review the relevant literature to provide 
a unified perspective of the discipline and its gaps; assess the state of the debate; offer 
a contextual analysis of the significance, relevance, and value of various topics; and 
identify neglected or underexplored areas of philosophical interest. We do not discuss 
data science’s GELSI (governance, ethical, legal, and social implications). These have 
already receive considerable attention, and their analysis would lie beyond the scope 
of the present work, even if, ultimately, we shall point to obvious connections. We 
also limit our review to the epistemological debate concerning data science, without 
entering the related, significant, but distinct debates about the epistemology of data 
and the role of data in epistemology (including philosophy of science and scientific 
research). It seems clear that the epistemology and ethics of data science (in the 
inclusive sense of GELSI indicated above) may need to find a unified framework. 
Still, this article would be the wrong context to attempt such a unification. 

Methodologically, we determined the scope of this epistemological analysis by 
a structured literature search in the sense indicated by Grant and Booth (2009) and 
detailed in the “Appendix”. Its content was therefore determined empirically. This 
empirically-driven approach has a twofold motivation. First, there is now an area of 
enquiry explicitly labelled “data science”, which is something more than simply a 
general study of data, or a science which investigates “data” (indeed, almost anything 
might be understood as data science in one way or another, with such a broad char- 
acterisation). Whatever “data science” is (see Sects. 1 and 2), it has an associated 
epistemology, and fundamental philosophical issues have only recently begun to be 
explored, at least in a critical fashion, as a philosophy of data science. Second, since, 
as we noted above, this philosophy is still relatively fragmented, an empirically-driven 
approach to what is already available in terms of current debates makes possible an 
overview of this emerging discipline, returning at least a first image of its landscape 
and contours. 

Our findings partition the epistemology of data science into five areas, and the article 
is structured accordingly. We present them in a logical rather than chronological order. 
We begin in Sect. 2 by focusing on the characterisation of data science—what it is 
or should be—focusing on descriptive and normative accounts of this new discipline, 
including a reference to what data scientists do and should do. In Sect. 3, we analyse 
the debate about what kind of enquiry data science is. Next, in Sect. 4, we analyse 
the related debate about the nature and genealogy of the knowledge that data science 
produces. In Sect. 5, we concentrate on one of the most significant methodological 
issues in data science, the so-called “black box” problems, such as interpretability 
and explainability. Still on the methodological side, in Sect. 6, we outline the various 
explorations of the epistemically revolutionary new frontier raised by data science: 
the so-called “theory-free” paradigm in scientific methodology. In Sect. 7, we briefly 
summarise our analysis. 
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2 The characterisation of data science 


This section reviews the most relevant definitions of data science proposed in the 
literature, spanning descriptive and normative alternatives. It concludes by offering a 
proposal that synthesises the most valuable elements of each. 


2.1 Minimalist and maximalist characterisations 


At the dawn of the twentieth century, statistics came to be recognised as an aca- 
demic discipline worthy of its own journals and university departments. Technological 
advances in subsequent decades marked a definite break from theory-driven and infer- 
ential classical statistics. New approaches, such as bootstrapping and Markov chain 
Monte Carlo simulations, replaced strong parametric assumptions with brute compu- 
tational power. Viewed from this perspective, machine learning algorithms—which 
automatically detect and exploit subtle patterns in large datasets—are simply the next 
logical step in a centuries-long progression toward ever more automated forms of 
empirical reasoning. 

The question of when precisely these early forays into quantitative modes of analy- 
sis crystallised into what we now call “data science” presupposes that the discipline has 
some yet unspecified essential character. Although we are sceptical of any purported 
“solution” to the so-called “problem” of demarcation—in this area, as in science more 
generally—we observe two broad trends in the literature on this topic, which we shall 
label the “minimalist” and “maximalist’” accounts (more on this below). As we shall 
see, minimalists aim for necessary conditions, as weakly constraining as possible but 
still carving out a unique space for data science. Maximalists strive for sufficient condi- 
tions with detailed ontologies and methodological taxonomies. Minimalist approaches 
characterise early debates on the nature of data science. Contemporary analyses tend 
to embrace maximalist approaches, identifying in data science a means to develop 
causal knowledge directly connected to the object of investigation. 

Minimalist conceptions do not commit data science to any method or subject(s), 
and do not make any specific claims about what kind of discipline data science is. 
They focus only on the pedagogical aspects and their dependency on information and 
data. Chambers (1993) and Carmichael and Marron (2018) provide two examples 
of minimalist accounts. Chambers (1993) presents a “greater statistics” view of data 
science, characterised as “everything related to learning from data” (Chambers, 1993, 
p. 182, italics in the original). Similarly, Carmichael and Marron (2018, p. 117) claim 
that data science is “the business of learning from data” and that a data scientist is 
someone who “uses data to solve problems”. 

Maximalist accounts are more fine-grained and qualify the discipline’s relation to 
data in general. Breiman (2001) gives a maximalist account with a single qualification, 
characterising data science as a subject interested in two broad classes, “prediction” 
and “information”. Prediction is “to be able to predict what the responses are going to 
be to future input variables”, while information is “to extract some information about 
how nature is associating the response variables to the input variables”. According to 
Breiman, statisticians (taken to include data scientists) may be interested in making 
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correlative predictions from data and also extracting information about any associated 
underlying natural causal mechanisms. Therefore, he makes an epistemic distinction 
between correlative/predictive knowledge on the one hand and causal knowledge on 
the other. That Breiman’s vision of data science is concerned with these two products 
means that his account is narrower than the minimalist accounts above. 

Another maximalist account is provided by Mallows (2006, p. 322), who gives 
statistics an essentially practical nature. As he writes, “Statistics concerns the rela- 
tion of quantitative data to a real-world problem, often in the presence of variability 
and uncertainty. It attempts to make precise and explicit what the data has to say 
about the problem of interest.” Mallows emphasises the primacy of problem-solving 
of the applications of data to the “real world” rather than general and more abstract 
intellectual enquiry. Thus, his vision of statistics is somewhat removed from a purely 
mathematical and formal conception [this point is also stressed by Blei and Smyth 
(2017), discussed below]. A unique aspect of Mallows’ account is his explicit men- 
tion of variability and uncertainty, which data-scientific and statistical methods must 
confront. This embodies an implicit commitment to the separation of the noisy real 
world and the idealised constructs familiar to the natural and social sciences. This 
separation is essential. It means statistics is characterised as a fundamental epistemic 
method concerned with how human beings relate to the world around them. Statistics 
forms a kind of epistemic bridge between the two worlds. 

Donoho (2017, p. 746) also supports a maximalist approach. His account of data 
science has a distinctively sociological dimension, referencing the Data Science 
Association’s “professional code of conduct”: “‘Data Scientist’ means a professional 
who uses scientific methods to liberate and create meaning from raw data [our 
italics].” The words “liberate” and “create” indicate that Donoho’s account is con- 
sistent with two broad philosophies of science: realist and antirealist leanings. That 
data science can liberate meaning assumes a realist position, according to which we 
uncover a particular objective and independent ontology through scientific enquiry. 
It suggests that data originates from phenomena and processes that are inherently 
amenable to a systematic study and comprehension. However, the term “create” 
implies a more antirealist conception, according to which we superimpose artificial, 
context-related ontologies upon data as means to our particular ends. The extent to 
which data science creates or liberates meaning will depend on one’s position in 
such a debate. Conceptions of data science, like Donoho’s, seek to accommodate 
both. However, Donoho’s account remains problematic because of its unqualified 
and underdeveloped reliance on the concept of “raw data”. Strictly speaking, data are 
never entirely devoid of interpretation, so there is no such thing as “raw data”. Data 
are always loosely interpreted, at least because they have been selected instead of 
other data, and because they are framed by paradigms, hypotheses, theoretical needs, 
background assumptions, and so forth. As Donoho writes from within the era of big 
data, his assumption that “raw data” is a suitable base from which to liberate and 
create meaning is a consequence of the contemporary attitude that data can, are, and 
will be recorded in sufficient depth, breadth, and quality for any problem domain. 
However, we shall see in Sect. 6 that this assumption is questionable. 

Leonelli (2020, Sect. 1) offers an account of data science in her attempt at an initial 
definition of Big Data: “Perhaps the most straightforward characterisation is as large 
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datasets that are produced in a digital form and can be analysed through computational 
tools”. Here, she draws attention to the computational methodology characteristic of 
modern data-scientific investigation, and underscores the close relationship between 
data science (a mode of inquiry) and Big Data (a class of information). Blei and Smyth 
(2017, p. 8691) agree with Leonelli’s emphasis on computational methodology, but fur- 
ther qualify it with a practical component: “data science blends statistical and compu- 
tational thinking... It connects statistical models and computational methods to solve 
discipline-specific problems.” They prioritise statistical and computational methods, 
thus emphasising a practical rather than pedagogical priority. However, this charac- 
terisation does not specify information—broadly conceived—as data science’s object 
of interest, nor does it mark specific disciplines as parents or patients of data science. 


2.2 Descriptive taxonomies 


Some authors have attempted to characterise data science by providing descriptive, 
procedural taxonomies of the discipline. The analysis of three descriptive accounts 
written at different times over the last six decades offers a diachronic perspective. 

Let us begin with Tukey’s (1962) account. This appears to be the first descriptive 
taxonomy of “data analysis’, focusing on: “procedures for analysing data and tech- 
niques for interpreting the results of such procedures; ways of planning the gathering of 
data to make its analysis easier, more precise, or more accurate; all the machinery and 
results of (mathematical) statistics which apply when analysing data” (Tukey, 1962, 
p. 2). Tukey intended to give a transparent description of what actually occurs in the 
analysis of data. As we shall discuss in Sect. 3, the orthodox view at his time of writing 
was that data analysis was applied statistics, and hence primarily mathematical. By 
describing its nature plainly and accurately, Tukey’s account was a transgression of 
the status quo: breaking off the concept of data analysis from applied statistics into its 
own field. 

Some years after Tukey, Wu (1997) presented a threefold descriptive taxonomy 
centred on data collection (experimental design, sample surveys); data modelling and 
analysis; problem understanding/solving, and decision making. Like Tukey’s, this 
description came as part of a broader project to move mathematical statistics in a 
scientific direction. Wu bid to rename “statistics” as “data science” or “statistical 
science”, and we note the inclusion of the manifestly scientific “experimental design”. 

More recently, Donoho (2017) has provided an extensive taxonomy which cites 
the University of Michigan’s “Data Science Initiative” programme: “[Data Science] 
involves the collection, management, processing, analysis, visualization, and inter- 
pretation of vast amounts of heterogeneous data associated with a diverse array of 
scientific, translational, and interdisciplinary applications” (Donoho, 2017, p. 745). 
A brief comparison with Tukey’s and Wu’s accounts highlights the maturation and 
growth of data science: the procedural pipelines have coevolved with intermediary 
stages between inputs and products. 

Earlier accounts ought not to be faulted for missing a moving target, as they may 
not have foreseen the growing demands and affordances of the digital era. However, 
we may still identify a trade-off between constraint specificity and contemporaneity in 
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descriptivist accounts of data science, which has evolved along with computation. Any 
account that isolates data science from its computational context risks obsolescence. 
However, any account that does not must grapple with a massively entangled and 
evolving digital sphere, with all its attendant mechanisms and requirements. There- 
fore, going forward, we distinguish ones-and-zeroes data science from pen-and-paper 
statistics by its digital and computationally intensive nature. 


2.3 Normative taxonomies 


Some researchers thought that the status quo conception of data science of their time 
was inadequate to meet the demands placed on society by the proliferation of data. This 
led them to develop revisionist accounts, often proposing normative taxonomies of data 
science. Here, we consider what are arguably the four most influential revisionary 
accounts offered so far: Chambers (1993), Breiman (2001), Cleveland (2001), and 
Donoho (2017). 

Chambers (1993) remarks that, at the time of his analysis, there was a trend in 
academic statistics towards what he calls “lesser statistics’”—mathematical statistics 
filtered through journals, textbooks, and conferences—rather than engaging in real- 
world applications to data. In this context, he presents the following tripartite taxonomy 
(Chambers, 1993, p. 182) of the composition of his “greater statistics”, referring to 
the concept mentioned earlier as “everything related to learning from data”: 


1. Preparing data (planning, collection, organization, and validation); 
2. Analysing data (by models or other summaries); 
3. Presenting data (in written, graphical or other form). 


Chambers’ taxonomy delineates the processes and products of data science from the 
decision-making and outcomes that result from those products. The promotion of data 
preparation to stand equal to analysis and presentation is remarkably prescient. The 
subprocesses of planning, collection, organisation, and validation anticipate, respec- 
tively, the sourcing, volume, diversity, and quality of data required of practical data 
science, as opposed to the abstract concerns of “lesser statistics”. When taken together 
with the descriptions of the analysis and presentation of data, a conception is revealed 
of human limitations when confronted with data, and with data science seen as the 
epistemic endeavour to exceed those limitations. 

Breiman (2001) echoes the need for statistics to move towards the real world. 
Like some of the maximalist statements of data science analysed in Sect. 2.1, his 
account too emphasises that data analysis collaborates with, and thus acts on, specific 
disciplines, supplying them with analytical tools. To understand Breiman’s radical 
normative conception of data science as disinterested in truth, in favour of practical 
knowledge, one needs to engage in a brief historical and sociological detour. In homage 
to C. P. Snow, Breiman remarks that preference for truth or action characterises two 
contrasting “cultures” in statistics: the predictive camp, which he estimated at his time 
of writing in 2001 contained only ~2% of academic statisticians and data analysts, 
and the inference camp containing the rest. Those in the former camp are primarily 
interested in generating accurate labels on unseen data. Those in the latter focus on 
revealing mechanisms and estimating parameters. This is a distinction we shall revisit 
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in Sect. 4. Breiman’s revisionism becomes manifest when he argues that the emphasis 
on inference over prediction has led to a distracting obsession with “irrelevant theory” 
and the drawing of “questionable conclusions’, thereby keeping statisticians “from 
working on a large range of interesting current problems”. Today, the relative sizes of 
the two cultures are nearly reversed (see, for example, Anderson, 2008). Breiman’s 
vision of a theory-free data science marks a significant deviation from the classic 
epistemological project of “understanding understanding”. 

Cleveland (2001, pp. 22—23) considered the teaching programs of his time to be 
deficient, producing data practitioners unprepared for the demands of an increasingly 
data-rich society. In this sociological context, he proposed the following taxonomy: 


1. Multidisciplinary investigations (data analysis in the context of different discipline- 
specific areas) 

2. Models and methods for data (statistical models, model-building methods, esti- 

mation methods, etc.) 

Computing with data (hardware, software, algorithms) 

Pedagogy (curriculum planning, school/college/corporate training) 

5. Tool evaluation (descriptive and revisionary analysis of tools and their methods of 
development) 

6. Theory (foundational and theoretical problems in data science) 


Pn 


Cleveland’s taxonomy puts forward a conception of data science as a fundamentally 
computational, fully-fledged, scientific discipline in its own right. Four points are 
particularly relevant in this case. First, the elevation of “computing” alongside “models 
and methods for data” marks data science as fundamentally digital, separating it from 
statistics at large. In contrast to Chambers’ taxonomy, computers are by now explicitly 
recognised as the vehicle that makes data science possible. Second, under “pedagogy”, 
there is recognition of the necessity to preserve and propagate data science as an 
academic and commercial field. Third, the novel inclusion of “tool evaluation” and 
“theory”, absent in previous accounts, signals a conception of data science as self- 
reflective and progressive. Fourth, the fact that “multidisciplinary investigations” is 
placed on the same footing as the other five taxa indicates a relative deprioritisation 
of application. This is a significant shift from preceding accounts, like Breiman’s, that 
treat data science merely as a means to an end. 

More recently, Donoho (2017, p. 755) has given a comprehensive revisionary taxon- 
omy to meet current needs. Emulating Chambers’ terminology, “greater data science” 
is set in contrast to some of the descriptive taxonomies described in Sect. 2.1, which 
he calls “lesser data science”. Greater data science consists of: 


Data gathering, preparation and exploration 
Data representation and transformation 
Computing with data 

Data modelling 

Data visualisation and presentation 

Science about data science 


ON Uu e U S 


In contrast to Cleveland’s taxonomy, Donoho’s focuses just on data science qua 
field and means of enquiry. Two aspects of this taxonomy are epistemologically inter- 
esting. First, mirroring Cleveland, the repeated presence of the sixth metascientific 
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category: data science should reflect and conduct science on itself in order to improve 
and develop. Second, Donoho’s description is procedurally complete, beginning with 
data exploration and gathering, and going through all the analytical steps from origins 
to final products. This ambitious scope contributes to the normative force of Donoho’s 
proposal. 

Considering these previous points, it seems reasonable to propose the following 
working definition of data science, which we shall use in the rest of this article: 


Data science is the study of information systems (natural or artificial), by 
probabilistic reasoning (e.g., inference and prediction) implemented with com- 
putational tools (e.g., databases and algorithms). 


This definition is inclusive enough to cover all of machine learning, as well as more 
generic procedures that typically fall under the umbrella of statistics, such as scatter 
plotting to inspect trends and bootstrapping to quantify uncertainty. It may or may 
not exclude some edge cases, depending on one’s interpretation of constituent terms. 
For instance, it covers deterministic systems if one holds that these are a subset of 
probabilistic systems. It covers hand-calculated regression models if one holds that 
human cognition is a kind of computation. Yet these are grey areas, even if the former 
may be an obvious case of computer science and the latter an obvious case of statistics. 
Data science stretches across both disciplines, emphasising different aspects. 
We can now turn to the question of what kind of enquire data science may be. 


3 Kind of enquiry 


Critics may allege that data science is not an academic discipline, but a set of tools, 
bundled together through pragmatic functions. At issue is whether data are the “right 
kind of thing” to stand as the subject matter of a discipline. If “data” is a concept too 
insubstantial or the methods of data science are too heterogeneous, then any attempt 
to carve out a unified data science seems doomed to fail.! There is a growing demand 
for data science, not just in the business world, but also in academia, as evidenced 
by a proliferation of university courses and programs, specialised teaching positions, 
dedicated conferences, journals, and industry positions. Therefore, for the sake of 
argument, one may assume that data science may be on its way to becoming an 
entrenched and mature discipline. If this is the case, the next question is of what kind. 
The literature offers three main answers: a sort of academic statistics, where statistics is 
a formal, theoretical part of applied mathematics; statistics, but appropriately expanded 
to bring it outside of applied mathematics and into a proto-science; and a full-blown 
science in itself. Let us examine each alternative in detail. 


! This is an open question in the philosophy of information, which we will not address here, as it is deep 
enough to warrant its own dedicated investigation. We will, however, note that a sustained attempt at some 
analysis of the concepts of data and information may be found in Floridi (2010). 


A Springer 


Synthese (2022) 200:469 Page9of27 469 


3.1 Data science as statistics 


The first two approaches take data science to be some form of statistics. For example, 
Donoho (2017, p. 746) provides a comprehensive collection of papers, talks, and blogs 
whose authors argue that data science simply is statistics by a different name. This 
stance further speciates according to whether one takes statistics to be part of, or 
separate from, applied mathematics. Arguing for the former case, Wu (1997) cites a 
dictionary definition of statistics: “the mathematics of the collection, organisation and 
interpretation of numerical data”. This narrow view of data analysis does not have 
many contemporary proponents. Most of the current literature either accepts that data 
analysis is part of an extended statistics—which itself is no longer seen as strictly 
formal mathematics (cf. Chambers’ greater statistics)—or grants data analysis the 
status of a standalone field, external but related to statistics, which is considered a 
narrow part of formal, applied mathematics. Breiman (2001) and Mallows (2006) take 
the latter stance, calling for the expansion of statistics to include scientific elements 
and engage with real-world disciplines. This does not entail that statistics is itself a 
full-bodied science. Data analysis, in this view, remains statistics, even though it begins 
to transcend strictly formal, mathematical, or deductive inference and practices. 


3.2 Data science as science 


Other authors locate data analysis as a scientific discipline (with a characteristic con- 
cern for systematising the material world) rather than a mathematical one (with a 
typical concern for exploring the deductive consequences of mathematical axioms). 
Carmichael and Marron (2018, p. 120) claim that a manifestly scientific understand- 
ing of data science is a “reaction to the narrow understanding of [Chambers’ | lesser 
Statistics” [our italics]”. Two main strategies support the claim that data analysis is a 
science. 

The first is to formulate demarcation criteria for whatever we already call science 
(cf. Popper, 1959), and then show that data science satisfies them. Tukey (1962, p.5) 
made this attempt, setting out three paradigmatic demarcation criteria for science: “in- 
tellectual content”, “organization into an understandable form’, and “reliance upon 
the test of experience as the ultimate standard of validity”. By running up his con- 
temporary data science against these criteria, Tukey concluded that whatever makes 
other disciplines scientific also applies to data science. In a similar way, (Donoho, 
2017) focuses on a paradigmatic scientific feature of a subject: the formulation of 
empirically accountable questions which are solved through scientifically rigorous 
techniques. Since there is conceptual room for a field of this nature that operates 
on data and information, he concludes that there is space for a forthcoming genuine 
science of data analysis. 

However, this first strategy struggles with the heterogeneity of science and has 
waned in popularity [see also Laudan’s (1983) decisive critique of Popper’s falsifica- 
tionism]. Considering this, an alternative strategy is to demonstrate relevant similarities 
between data science and paradigmatic sciences, and argue that these similarities war- 
rant an extension of the general concept. For example, Wu (1997) cites a series of 
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important similarities between his descriptive taxonomy of statistics and paradigmatic 
sciences. These similarities include: the “empirical—physical” approach of statistics, 
in which we use induction to infer knowledge from observations and deduction to 
infer implications of theories; the primacy of experimental design and data collection; 
and the use of Bayesian reasoning to evaluate models and evidence. However, there 
are notable ways in which data science diverges from paradigmatic sciences. Such 
dissimilarities include the kind of knowledge it generates (see Sect. 4), the modes 
of logical inference by which it proceeds (see Sect. 4), and the status it endows to 
hypotheses (see Sect. 6). 

A further dissimilarity may be that data science sits alongside normal sciences, 
providing them with the tools and resources needed to make more profound, discipline- 
specific discoveries. If these dissimilarities are regarded as sufficiently significant, 
it becomes plausible that data science might not be a science at all, or may be a 
transcendental science. We turn to this topic in the next section. 


3.3 Data science as something else 


The debate above overlooks the possibility that data science may be best understood 
neither as an applied mathematical statistics nor an empirical science but as something 
else altogether. Wiggins, for example, has expressed this thought in private commu- 
nication with Donoho, claiming that “Data science is not a science... It is a form 
of engineering, and the doers in this field will define it, not the would-be scientists” 
(Donoho, 2017, p. 764). Wiggins takes the pragmatic and social ends of data science 
to distinguish it from both mathematics and empirical science, with a closer affinity to 
the essentially pragmatic interdiscipline of engineering. Perhaps a similar claim could 
be made about computer science, which is rooted in mathematics but sufficiently spe- 
cialized, interdisciplinary and practically oriented to constitute its own individuated 
field of enquiry. 

Another possibility would be understanding data science as a discipline that oper- 
ates on and assists other forms of enquiry. This was, for example, how Wittgenstein 
came to view philosophy, as a set of tools and methods for resolving linguistic confu- 
sions both within philosophy itself and in other areas of thought, like mathematics or 
psychology. A similar function might be attributed to data science, as an essentially 
operative discipline that works in tandem with various disciplines (natural science, 
social science, history and the humanities, etc.) to solve their own problems. If one 
has a general enough conception of data, one might even understand this operative rela- 
tionship as, in fact, one of data science as an external basis for other forms of enquiry. 
Data science might be conceived as serving a transcendental function for other sci- 
ences, as the condition for the possibility of their scientific inquiry as such. There 
is nothing essentially different between the structures of understanding across disci- 
plines, like Linnaeus’ taxonomies and the hierarchical ontologies familiar to database 
managers. Tycho Brahe’s journals are essentially a high-quality dataset of one kind. 
Newton’s laws of motion can be understood as an algorithm, obtained from empirical 
data, and verified against them, for predicting values for some physical variables based 
on the values of others. We shall not pursue this approach in this article, offering it as 
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a suggestion to be explored (in terms of gap analysis of the literature reviewed) rather 
than a thesis to be defended. 

Whatever data science is, and irrespective of what kind of science it may be, the 
epistemological debate also focuses on the nature of knowledge it generates. This is 
the topic of the next section. 


4 The knowledge generated by data science 


This section examines the debate about the knowledge generated by data science. The 
analysis is structured into two related parts. One focuses on positions that privilege 
the process, or how (concerning modes of inference). The other gives an account of 
positions that prefer the product, or what (referring to the epistemic products). 


4.1 Modes of inference 


Different means of enquiry have differing affinities to the three typical modes of 
inference: deduction, induction, and abduction. The epistemology of data science 
reflects on the extent to which data scientists deploy these various modes. 

Deductive inferences are present in data science through the heavy reliance on 
mathematical and logical reasoning. Probability theory, differential calculus, func- 
tional analysis, and theoretical computer science are all purely deductive disciplines 
widely used to derive the properties of algorithms and design new learning procedures 
with little concern for empirical behaviour. For instance, the backpropagation algo- 
rithm used to optimise parameters in neural networks combines elements of linear 
algebra and multivariable calculus to converge, provably, on a local optimum of an 
objective function. No datasets are required to derive this result. 

Inductive inferences are also of central importance. Any data are a finite sample 
of the world. Data science then identifies structures in the data and distils them into 
information that applies beyond the data itself. This is achieved by projecting the 
patterns and structures found in data to new contexts, going beyond the antecedent 
domain. This projection is an inductive inference. Harman and Kulkarni (2007) argue 
that statistical learning theory represents a principled and sophisticated defence of 
induction. Similar remarks can be found in Frické (2015), who observes that “Inductive 
algorithms are a central plank of the Big Data venture.” More recently, Schurz (2019) 
has argued that formal results from reinforcement learning demonstrate the optimality 
of meta-induction, thereby solving Hume’s problem on a priori grounds. In other 
words, this represents a defeasible solution to Hume’s problem of induction, whereby 
statistical testing can provide stronger or weaker evidence in favour of particular 
hypotheses (Mayo, 1996, 2018). 

One can distinguish between two canonical types of inductive inference, object and 
rule induction. The first is the informed prediction of singular unobserved instances: 
hypotheses of the form “the next observed instance of X will be Y” based on previous 
data of the co-instantiation of X’s and Y’s. This is known as object induction. Rule 
induction, by contrast, posits universal claims of the form “all X’s are Y’s”, based on 
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the same data. Data-scientific investigation involves both. Singular predictive instances 
are commonplace in any application of supervised learning, where the goal is to learn 
a function from inputs to outputs. These are the kinds of inductions that interest 
Breiman’s (2001) “first culture” of statistics. At the same time, one of the purposes of 
data science is to identify underlying structures and mechanisms. The project of causal 
inference, which we revisit in Sect. 4.2, is devoted to such forms of rule induction. 

Turning to abductive inference, Alemany Oliver and Vayre (2015) have empha- 
sised the importance of abductive reasoning in data science methods, particularly in 
how data science is embedded into broader scientific practice (see Sect. 6 for further 
discussion). They argue that the tools of data science are useful first in exploring data 
to determine its internal structure, and second in identifying the best hypotheses to 
explain this structure. This inference from structure to an explanatory hypothesis is an 
abductive inference. The view that science is essentially abductive can be traced back 
to Peirce, though modern adherents abound (Harman, 1965; Lipton, 1991; Niinilu- 
oto, 2018). The status of abduction in a data-intensive context is further elevated by 
the theoretical virtue of explanatory unification (Kitcher, 1989). In the philosophy of 
science, a common virtue of a theory is its explanatory power, with some authors 
maintaining that such power is grounds to choose one of two empirically equivalent 
theories (cf. van Fraassen’s (1980) discussion of pragmatic virtues). One dimension of 
explanatory power is the extent of the diversity and heterogeneity of phenomena that 
a theory can explain simultaneously (cf. Kitcher, 1976). If the methods of data science 
make possible the identification of patterns in a diverse and heterogeneous range of 
phenomena, then perhaps we will develop a broader and more nuanced picture of the 
explanatory power of our theories. For those theories that can unify many phenomena, 
abductive reasoning confers more robust support on them considering various data 
science techniques. 

In addition to being an end in itself, epistemological reflection on modes of influence 
also sheds light on the connections between data science, mathematics, and science. 
The similarities between these disciplines—such as their relevance (Floridi, 2008), 
explanatory power, practical utility, and degree of success—are precisely what is in 
question when we look to extend the categories coherently. For example, mathemat- 
ical proofs are formulated deductively. But given the importance of non-deductive 
inferences in data science, one needs to recognise an important difference between 
the two and refrain from placing data science strictly within applied mathematics. 
Likewise, natural sciences use a mixture of deduction, induction, and abduction in 
everyday practice, with more formal sciences making more frequent use of deduction, 
and more applied sciences relying more on abduction. Other sciences assign different 
weightings to differing modes of inference. For example, abduction is commonplace in 
the social, political, and economic sciences. Cognitive science is another example that 
relies on abduction, given the frequency of empirically equivalent, underdetermined 
theories. It seems that data science, if it is a science, is in good company. 
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4.2 Epistemic products 


The trichotomy of machine learning—which spans supervised, unsupervised, and 
reinforcement learning algorithms—helps delineate the kind of knowledge generated 
by data science and its techniques. 

Supervised learning models predict outcomes based on observed associations. They 
automate the process of inductive reasoning at scales and resolutions that far exceed 
the capacity of humans. However, large datasets and powerful algorithms are insuffi- 
cient to overcome the fundamental challenges inherent to this mode of inference. A 
model that does well in one environment may fail badly in another, if data no longer 
conform to the observed patterns. For instance, a classifier trained to distinguish cows 
from camels may struggle when presented with a cow in the desert or a camel on 
grass, presuming the training set only contains images of both animals in their natural 
habitats. Since the background was a reliable indicator of the outcome in training, 
the model could be forgiven for assuming the same would hold at test time. More 
dramatic examples have emerged in the context of adversarial learning, where models 
are trained with the express goal of fooling another model into misclassifying images 
(Goodfellow et al., 2014). In this case too, the adversarial example succeeds in its 
deception precisely because it comes from a different data generating process than 
the original data (although this may not be clear to the naked eye). Such fallibility 
is inherent in all inductive reasoning, which nevertheless helps us accomplish many 
important epistemic goals. 

Unsupervised learning is a more heterogeneous set of methods, broadly united 
by their tendency to infer structure without predefined outcome variable(s). Examples 
include clustering algorithms, autoencoders, and generative models. At their best, these 
tools can shed light on latent properties—how samples or features reflect underlying 
facts about some natural or social system. For instance, cancer research commonly 
uses clustering methods to categorise patients into subgroups based on biomarkers. 
The idea is that an essential fact (e.g., that cancer manifests in identifiable subtypes) 
is reflected by some contingent property (e.g., gene expression levels). The risk of 
overfitting—i.e., “discovering” some structure in training data that does not generalise 
to test data—is especially high in this setting, as there is no outcome variable against 
which to evaluate results. 

In reinforcement learning, one or more agent(s) must navigate an environment with 
little guidance beyond a potentially intermittent reward signal. The goal is to infer 
a policy that maximises rewards and/or minimises costs. A good example of this is 
the multi-armed bandit problem. An agent must choose among a predefined set of 
possible actions—i.e., must “pull” some “arm” —without knowing the rewards or 
penalties associated with each. Therefore, an agent in this setting must strike a bal- 
ance between exploration (randomly pulling new arms) and exploitation (continually 
pulling the arm with the highest reward thus far). Reinforcement learning has pow- 
ered some of the most notable achievements of data science in recent years, such 
as AlphaGo, an algorithm that is currently the world’s best player of Go, chess, and 
several other classic boardgames. The epistemic product of such algorithms is neither 
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associations (as in supervised learning) nor structures (as in unsupervised learning), 
but policies—methods for making decisions under uncertainty. 

On their own, these methods do not necessarily provide causal knowledge. How- 
ever, some of the most important research on AI of the last 20 years has focused on 
causal reasoning (Imbens & Rubin, 2015; Pearl, 2009; Peters et al., 2017; Spirtes 
et al., 2000). Such research demonstrates how probabilistic assumptions can combine 
with observational and/or interventional data to infer causal structure and treatment 
effects. Remarkably, this literature is only just beginning to gain traction in the machine 
learning community. Recent work in supervised learning has shown how causal prin- 
ciples can improve out-of-distribution performance (Arjovsky et al., 2019), while 
complex algorithms such as neural networks and gradient boosted forests are increas- 
ingly used to infer treatment effects in a wide range of settings (Chernozhukov et al., 
2018; Künzel et al., 2019). The task of learning causal structure from observational 
data is a quintessential unsupervised learning problem. This has been an active area 
of research since at least the 1990s and remains so today [see Glymour et al. (2019) 
for a recent review]. Reinforcement learning—perhaps the most obviously causal of 
all three branches, given its reliance on interventions—has been the subject of intense 
research in the last few years (Bareinboim et al., 2021). Various authors have shown 
how causal information can improve the performance of these algorithms, which in 
turn helps reveal causal structure. 

These methods can, in principle, be used to infer natural laws. Schmidt and Lipson 
(2009) have proposed what appears to be the algorithmically obtained laws of classi- 
cal mechanics. Their method involved analysing the motion-data of various dynamical 
systems using algorithms without prior physical knowledge of mechanics. They claim 
to obtain the Lagrangian and Hamiltonian of those dynamical systems, together with 
various conservation laws. This is an important result for those hoping for the pos- 
sibility of the autonomous discovery of natural laws. More recent work on symbolic 
metamodels provides a more general strategy for deriving interpretable equations from 
complex machine learning models (Alaa & van der Schaar, 2019). We shall discuss the 
roles of correlation and causation in science, and of autonomous, theory-free science 
in Sect. 6. Leonelli (2020, Sect. 7) also considers the nature of the epistemic products 
of data science (the extent to which they are predictions or knowledge of causations) 
and their consequences on scientific investigation. 


5 Black box problems 


Tools that produce more successful (more efficient, accurate, deployable, etc.) out- 
comes are adopted in virtue of their utility, often without reflection on how we are to 
understand them and the mechanisms by which they work. This has led to questions 
about the opacity of these tools, where opacity is here understood as our inability to 
find intelligible mechanisms through which some outcome is achieved. Opaque tools 
have come to be known as “black boxes” due to their lack of epistemic transparency. 
In this section, we consider and evaluate the epistemological debate in the literature 
on data science about a variety of black box problems. 
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It may be helpful to begin with some clarification. Burrell (2016) has proposed that 
there are three ways in which data science algorithms become opaque. The first is their 
intentional concealment for commercial or personal gain. The second is the opacity 
that arises from technological literacy and proficiency being a necessary condition 
to understand sophisticated algorithms. And the third is the inherent complexity that 
arises from algorithmic optimisation procedures that exceed the capacity of human 
cognition. The first two of these problems are pragmatic problems that occur when 
data science is embedded in wider society [see Tsamados et al. (2020) for recent work 
on these issues]. They are not the kind of in-principle epistemological problems that 
concern us here. Thus, we will focus only on the last problem. There have been many 
“technical solutions”, or proto-solutions, to various instances of black box problems 
which proceed through fine-grained mathematical investigations, and do not attempt 
to integrate black boxes into the ordinary understanding. Again, we are not concerned 
with these here, because we are focused on the level of abstraction of concern to 
philosophy, which is above such technical investigations. In this section, we provide 
only a brief, comparative overview to illustrate (dis)similarities, or instances where 
putatively different problems may collapse into one. 

Black box problems can be placed into two broad categories, which we shall organ- 
ise as conceptual and non-conceptual. Conceptual problems concern whether and 
when the concepts belonging to our ordinary understanding can be employed sensi- 
bly and intelligibly in discussions of black boxes and their workings. For example, 
a conceptual problem is whether the term “explainability” may be coherently and 
unambiguously defined in a machine learning context. Non-conceptual problems, in 
contrast, do not concern the appropriate use of ordinary concepts in machine learn- 
ing contexts, but the broader problems that result from these concepts. Within these 
non-conceptual problems, we will restrict our focus to those in the domain of episte- 
mology. However, it is worth recognising that further non-conceptual problems arise 
elsewhere, for example, in ethics or politics. 


5.1 Conceptual problems 


Some black box problems arise from our ordinary concepts being inadequate or unclear 
when projected into machine learning contexts. Lipton (2018) has acknowledged this 
imprecision over the use of “interpretation”. He observes that “the task of interpreta- 
tion appears underspecified. Papers provide diverse and sometimes non-overlapping 
motivations for interpretability and offer myriad notions of what attributes render mod- 
els interpretable” (Lipton, 2018, p. 36). Similarly, Doshi-Velez and Kim (2017) have 
remarked on the lack of agreement on a definition of “interpretability”, and further 
about how it is to be evaluated. They identify two paradigmatic uses of “interpretabili- 
ty” in the literature: interpretability in the context of an application and interpretability 
through a quantitative proxy. Rigorous definitions of both are found lacking. 

There have been a few attempts to respond to such conceptual problems. One 
important first step is to construct a clear taxonomy of how problematic concepts 
like interpretability are used, and what the desiderata and methodologies of inter- 
pretability research are. This is the kind of project in which Lipton (2018) engages. 
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A second attempt has been to refine these concepts, or at least conduct the ground- 
work to facilitate this refinement. Doshi-Velez and Kim (2017) engage in this kind of 
project, laying the groundwork for the subsequent rigorous definition and evaluation 
of interpretability. Authors have also refined the concepts of interpretability and its 
cognates by making fine-grained distinctions within them, adding to their structure. 
Doshi-Velez and Kim (2017) distinguish between local and global interpretability to 
avoid confusion. The former applies to individual predictions and the latter to the 
entire decision boundary or regression surface. Watson and Floridi (2020) make a 
similar distinction between local (token) and global (type) explanations, though in a 
more formal mathematical context. 

Further work on the representations deployed in black box problems concerns the 
relationship between various roughly synonymous terms: words like “interpretability”, 
“explainability”, “understandability”, “opacity”, and so on. It is of philosophical inter- 
est whether any or all of these terms overlap in whole or part. Some commentators take 
a coarse-grained approach to such cognates. Krishnan (2020), for example, takes them 
to be negligibly different, arguing that these terms all define one another in a circular 
fashion that does little to clarify imprecise concepts. Others take a more fine-grained 
approach. Tsamados et al. (2020) emphasise the difference between explainability 
and interpretability. The former applies to experts and non-experts alike, for example, 
the expert data scientist practitioner might need to explain the mechanics of some 
algorithm to their non-expert client. In contrast, the latter is restricted to experts 
(interpretability as interpretability-in-principle). Thus, in their view, explainability 
presupposes interpretability but not vice versa. 


5.2 Non-conceptual problems 


Non-conceptual problems and their solutions do not address deficiencies in represen- 
tations themselves. In this section, we will discuss four epistemological problems that 
have received less attention. 

Ratti and Lopez-Rubio (2018) have argued that interpretability is crucial to distil 
causal explanations from the correlations identified by data science techniques, as 
may be the case in a data-rich scientific context. Using the paradigm of mechanistic 
biological models, they observe that for biologists to turn data-scientific correlative 
models into causal models with explanatory power, the correlative models must be 
interpretable. This stems from a general trade-off: the more complex a model is, the 
less explanatory it becomes. Since the predictive powers of data-scientific models 
are positively correlated with their complexity, they conclude that there is a genuine 
epistemological black box problem. In Sect. 6, we shall see that this epistemological 
concern is more significant when considering the nature of the scientific method in 
a data-driven paradigm. Insofar as we expect computational, data-led investigations 
to become increasingly important for scientific progress, the algorithms and tools we 
deploy in these investigations ought to yield discoveries accessible to scientists and 
integrate readily with the wider scientific epistemic community. Thus, we agree with 
Ratti and Lépez-Rubio (2018) on the need to address such opacity problems, given 
the importance of a secure foundation for a data-driven science. 
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Watson and Floridi (2020) have construed overfitting as a different kind of episte- 
mological black box problem, as a kind of algorithmic Gettier case. Overfitting occurs 
when a machine learning model performs well in training but fails at test time. As 
an example, they cite the results of Lapushkin et al. (2016), in which pictures of a 
horse shared a subtle, distinctive watermark. The resultant image classifier strongly 
associated that watermark with the label “horse”, and thus failed to classify horses 
in a test set in which the watermark was absent. This Watson and Floridi propose, is 
importantly analogous to the Gettier cases of classical epistemology, which illustrate 
how epistemic luck can pull apart notions of knowledge and justified true belief. The 
model is justified to infer that watermarks entail horses, since the former was both 
necessary and sufficient evidence for the label in training. Moreover, when presented 
with a new image of a horse containing the tell-tale watermark, the prediction is true. 
Finally, we may come to believe in the model due to its high accuracy rate, at least 
on data from this training regime. However, the apparent accuracy is a mirage. The 
overfit model is right for the wrong reasons, and hence only accidentally true. Watson 
and Floridi argue that better tools for model interpretability can reveal such errors, 
mitigating the potential damage of overfitting. 

We recognise some notable differences with classical Gettier cases, though not ones 
that undermine the epistemic consequences of opaque algorithms. One such difference 
is that, due to the very threat of overfitting, we might avoid attributing genuine relia- 
bility to some machine learning algorithm in virtue of only its training performance. 
Rather, it remains possible that such algorithms are only deemed reliable when they 
have performed sufficiently well in non-training environments, so that overfitting is 
demonstrably and empirically insignificant. Differences aside, there appears to be a 
significant epistemological problem regarding accidentally true classifications that 
may be met through developments of machine interpretability. 

Krishnan (2020) has remarked on the broader epistemological point that, insofar 
as machine learning algorithms might have a pedagogical dimension (that we can 
learn from the mistakes that algorithms might make), they must be interpretable or 
understandable for us to learn anything at all. Lipton (2018, Sect. 2.4) (citing Kim 
et al., 2015) makes a similar remark on the informativeness of algorithms. Thus, there 
are significant epistemic benefits to greater algorithmic transparency. 

The discussion above gives the impression that these problems are substantial and 
worth solving. However, not all commentators agree. There are two main kinds of 
objections. Some concede that black boxes are opaque but deny that the correct way 
to proceed is to try to explain or interpret their inner workings. Instead, they argue 
that black boxes should be replaced altogether by equally capable non-black boxes 
(however, this strategy must answer tough questions of attainability). Others deny that 
black boxes are problematic at all. Let us look at each position in turn. 

Rudin (2019) has expressed an objection of the first kind. She agrees that the lack of 
interpretability of machine learning algorithms is a problem. However, she takes this 
not as motivation to construct better post hoc interpretability methods, but instead as 
a reason to reject opaque models altogether. She rejects the commonplace assumption 
that accuracy and interpretability are inversely related. In her view, black box problems 
should be dissolved (rather than solved) by globally transparent models that perform 
comparably to black box competitors. Rudin’s solution can be criticised for assuming 
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that it is realistic and practical to expect comparable alternatives to black boxes. One of 
the reasons black box problems exist is because the top performing models for many 
complex tasks are opaque. It is a chief difficulty of the problem that, in many cases, 
alternatives are either non-existent or impractical. 

Zerilli et al. (2019) have expressed an objection of the second kind, arguing that 
the opacity of black boxes is not a genuine problem. They see the explainability 
debate as evidence of a pernicious double standard. They point out that we do not 
demand explicit, transparent explanations from human judges, doctors, managers, 
military generals, or bankers. Rather, justification is found simply in past reliability: 
demonstrated and sustained accuracy and success. If we impose the same norms on 
algorithms, then the explainability problem is dissolved. 

Similarly, Krishnan (2020) has argued that our concerns about interpretability and 
its cognates are unnecessarily inflated. The inherent imprecision of these terms pre- 
vents them from doing the work required of them: “Interpretability and its cognates 
are unclear notions... We do not yet have a grasp of what concept(s) any technical 
definitions are supposed to capture—or indeed, whether there is any concept of inter- 
pretation or interpretability to be technically captured at all” (Krishnan, 2020, p. 490). 
But unlike Doshi-Velez and Kim, Krishnan does not take this as motivation to sharpen 
such concepts for subsequent progress, for worrying about them distracts from our 
real needs. Krishnan contends that most of the de facto motivations for treating inter- 
pretability as an epistemological problem in the first place are due to other ends (e.g., 
social, political, etc.). For example, algorithmic bias audits use explainability as a 
means to avoid unethical consequences. 

We are sympathetic to Krishnan’s overall project. Many authors uncritically assume 
that black box problems are necessarily important, and epistemological concerns about 
concepts like interpretability are indeed often means to other ends. However, we dis- 
agree that this observation undermines the status of purely epistemological concerns, 
as the examples from Sect. 5.2 attest. It might be the case that worrying about black 
box problems is an inefficient and suboptimal use of philosophical effort (particularly 
in the hyper-pragmatic context in which data science methods are mostly deployed). 
However, black box problems qua objects of epistemological interest remain relevant 
to at least some parts of a complete philosophy of data science. 


6 Normal science in a data-intensive paradigm 


So far, we have considered foundational and epistemological issues in the philosophy 
of data science. We may now broaden the investigation to consider how data science 
might relate to science and the philosophy of science more generally. 

The classical conception of the scientific method involves a specific gnostic relation 
between hypotheses and evidence. Hypotheses are derived from existing scientific 
theory (e.g. Peter Higgs’s 1960s prediction of the Higgs Boson) and then confirmed or 
falsified through experiments (e.g. the CERN confirmation of the Higgs Boson). The 
relation is gnostic because scientists within the paradigm posit specific connections 
between phenomena in advance of experiments. 
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However, it has recently been proposed that the proliferation of data has inaugu- 
rated a new era of agnostic science. Here, scientific knowledge can be generated, 
and mathematical and data-scientific methods deployed without prior knowledge or 
understanding of phenomena or their interrelations. A putatively agnostic science is 
one where experiments are in some sense “blindly” performed, and large amounts 
of data amassed. Then algorithms retrospectively seek correlations in this data, from 
which underlying causal laws and scientific generalisations can be extracted. Kitchin 
(2014) has compiled Gray’s work (found in Hey et al. (2009)) to elucidate the nature 
of this new paradigm and locate it in the history of science. This section explores the 
extent and implications of agnostic science. 


Paradigm Nature Form When 

First Experimental Empiricism; describing natural pre- 
phenomena Renaissance 

Second Theoretical Modelling and generalization pre-computers 

Third Computational Simulation of complex phenomena pre-Big Data 

Fourth Exploratory Data-intensive; statistical exploration Now 


and data mining 


Scientific paradigms taken from Kitchin (2014, p. 3), compiled from Hey et al. (2009) 


6.1 Agnosticism about the application of mathematics 


One identification of agnosticism is provided by Napoletani et al. (2018), who observe 
that the de facto application of mathematical techniques in science is undergoing an 
agnostic transformation. They remark that classical methods required both the prior 
understanding of phenomena and interconnections between elements in datasets. This 
is the case, for example, if one wishes to model some biological population using 
differential equations. The nature of the models one uses, which parameters to include, 
and so on, require the scientist to have antecedent knowledge and understanding of 
population biology, multivariate calculus, etc. They also need the scientist to know 
the basic structure of the dataset. Matters are very different in contemporary data 
analysis. There, the scientist can remain, to a great extent, agnostic or uninformed 
about any underlying scientific theory and the structure of their data. With the tools 
of contemporary data science, raw data can be parsed, and structure exploited more 
or less automatically. 

After observing that this appears to be an important direction in scientific prac- 
tice, Napoletani et al. raise the second-order question of why mathematics and data 
have such an effective synergy. They claim that a common response is to appeal to 
a Wignerian-like resignation to “unreasonable effectiveness” (Wigner, 1960). In this 
view, big data has a sort of omnipotence that grants unreasonable success to disparate 
and heterogenous data-scientific tools. However, Napoletani et al. (2018) reject this 
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response, arguing that the question can be reformulated into the more general question 
of whether the success of mathematical methods in an agnostic normal science is due to 
a similarity between the structure of those methods and the structure of the phenomena 
themselves captured in data corpora. This is a question that deserves further attention 
in the debate. While Napoletani et al. observe the increasing possibility of employing 
mathematical techniques agnostically, others have engaged in a more radical debate 
about whether this agnosticism heralds the end of theory choice in science altogether. 
This is the topic of the next section. 


6.2 Theory-free science 


Anderson (2008) has argued that classical theory-driven science is becoming obsolete. 
In his view, the density and plurality of correlations yielded by the analysis of extraor- 
dinary large amounts of data will become more useful than the causal generalisations 
provided by classical science. [Such views are also discussed in: Cukier and Mayer- 
Schoenberger (2013), Leonelli (2020); Prensky (2009); Steadman (2013)]. Kitchin 
(2014) provides a more formal characterisation of this view, which he calls a new 
type of empiricism, and Schmidt and Lipson’s (2009) aforementioned reconstruction 
of classical mechanics via machine learning is a provocative example of theory-free 
science in action. 

Critics object that this is sensationalist, over-optimistic and inflated. Kitchin (2014) 
presents a fourfold attack on Schmidt and Lipson’s (2009) analysis. His first contention 
is that, as much as large data corpora can try to exhaust information in a whole 
domain, they are nonetheless coloured by the technology used in their generation and 
manipulation, the data ontology in which they exist, and the possibility of sampling 
bias. Indeed, “all data provide oligoptic views of the world” (Kitchin, 2014). Second, 
following Leonelli (2014), he remarks that even the agnostic distillation of structure 
and patterns from data cannot occur in vacuo from all scientific theory. Due to their deep 
embedding in society, scientific theories and training always provide the scaffolding 
around data collection and analysis. Third, insofar as normal science is cumulative, he 
argues that the individual results of data-scientific investigations will always require 
interpretation and framing by scientists equipped with knowledge of scientific theories. 
And fourth, if data and the results of its analysis are interpreted free of any background 
theory, they risk becoming fruitless. It will be difficult for them to contribute to any 
fundamental understanding of the nature of phenomena since it “lacks embedding in 
wider ... knowledge” (Kitchin, 2014). Frické (2015) presents a similar view against 
this extreme kind of agnosticism. He objects that one needs antecedent theoretical 
insight to decide which data to provide inductive algorithms in the first place. Theory 
cannot be removed from science, even in a data-driven paradigm. 

Kitchin’s first point is echoed by Symons and Alvarado’s (2019) epistemological 
analysis of data and computer simulations, who express doubt that experimental sim- 
ulations can ever be purely epistemically justified by the pragmatic success of their 
correlative predictions. The authors argue that in addition to these pragmatic successes, 
data-driven scientific investigation only becomes credible if it is entrenched in a variety 
of other factors, including well-curated data, established scientific theory, empirical 
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evidence, and good engineering practices” (Symons & Alvarado, 2019, p. 0.3). Their 
view therefore underscores the mutually reinforcing relationship between theory and 
data in science, opposing a data-dominated conception. It also reiterates the signif- 
icance, that was apparent in Sect. 4 and elsewhere, of quality and diverse data sets. 
Elsewhere, Symons and Alvarado (2016, p. 3) further attack the notion of a theory free 
science, by connecting it problems of epistemic opacity. They examine the Google Flu 
Trends web service and observe a series of limitations within the resultant data set. 
These limitations, they contend, have at best only inadequately ambiguous sources; 
hence they question the possibility of a genuinely agnostic science, and further under- 
score the epistemological significance of black-box problems. 

We believe that these arguments can be supplemented with further reasons against 
this total agnosticism. The first reason relates to the critical issue in the philosophy of 
science of the theory-ladenness of observation, which holds that what one observes 
is influenced by one’s theoretical and pre-theoretical commitments (see also Leonelli, 
2020, Sect. 2). This is especially true for data science, where observations are gathered, 
labelled, and processed according to pre-existing categories and analysis routines. For 
any data to be manipulated and ultimately rendered intelligible to a human being, they 
must be represented under one particular concept or another. And in this process of 
conceptual subsumption, one’s theoretical commitments will always prevent any kind 
of total agnosticism. Second, it is plausible that Anderson’s claim that correlations 
will be sufficient for the future of science is too naïve a conception of the scientific 
enterprise. It reminds one of Bacon’s untenable view that Nature would speak by itself 
if adequately interrogated. Agnostic data science may generate a predictive science 
without knowledge of underlying natural laws or causal mechanisms, but prediction 
is not the only goal of the scientific enterprise. Explaining phenomena by knowing the 
underlying causal structure of the world, and helping to plan and intervene, are also 
two important goals of science. A similar line of argument is offered by Cukier and 
Mayer-Schoenberger (2013, p. 32). Third, and as perhaps Kuhn (1970) showed better 
than anyone else, science is still a fundamentally social project. Scientific paradigms do 
not exist in a vacuum. Science is developed by human experts embedded in a rich and 
complex sociocultural environment. Discourse about science, and scientific pedagogy, 
are indispensable aspects of what science is. Consequently, one might question whether 
a genuinely agnostic science would be recognisable as a science at all. Such a science 
would have a significantly different intellectual structure to contemporary versions of 
the discipline, and it is not clear whether such a difference could be accommodated. 

Total agnosticism, therefore, seems too extreme. The task then is how to integrate 
agnostic data-scientific practices into scientific methodology. Kitchin (2014) proposes 
a humbler account of this integration. He calls it “data driven science”, which takes 
the form of a rebalancing of the three modes of inference discussed in Sect. 4.1. He 
argues that contemporary normal science has an experimental-deductive dimension in 
which hypotheses are deduced from more fundamental hypotheses and then offered 
up for confirmation or refutation by experiment. In contrast, science in a data-driven 
paradigm elevates the status of inductive logic in this process of hypothesis formation, 
with experimental hypotheses generated from correlations identified by data-scientific 
methods rather than by deduction from parent hypotheses. However, in contrast to the 
naïve empiricist, Kitchin’s data-driven science does not involve the absolute primacy of 
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induction. Theories and their deductions play an essential role, for example, in framing 
data, directing which data-scientific processes to deploy, embedding results in wider 
knowledge, generating causal explanations, and so on. A picture of a new science 
emerges, involving a shift towards a more inductive enterprise, while maintaining 
many paradigmatic and realistic similarities to our current model of normal science. 

There have been further remarks about the introduction of data-scientific methods 
into the social sciences. Lazer et al. (2014) stress the emergence of “computational 
social science”, and Miller (2010) observes the proliferation of data in the context 
of regional and urban science. In both cases, the potential for data to reshape social- 
scientific practices is acknowledged. However, authors have noted the dissimilarities 
between natural and social sciences, which likely mean that the impact of data on the 
two categories will differ. 

It is likely that the future of data-intensive science will still be theory-based, though 
sometimes agnostic and data-scientific methods to assist in theory-generation will be 
used. Since Reichenbach (1938), there has been a popular distinction made in the 
philosophy of science between the contexts of discovery and justification: where a 
theory came from is irrelevant to whether the theory is sound. Consequently, it has 
become orthodox to consider scientific theories only for their own content, independent 
of their origins. The genealogy of our scientific knowledge has, classically, never been 
of epistemic relevance. 

This distinction may be questioned by the possibility of agnostic science in a 
data-intensive paradigm. For now, the genealogy of such agnostic knowledge that 
is generated autonomously from data is important: its epistemic standing is superve- 
nient on the tools and algorithms of data science that generated it and on the quality 
of the antecedent data. Thus, the reliability of automated inferences depends on the 
quality of the underlying data and the algorithm(s) used to extract information from 
them. Such questions about theory genealogy are perhaps too often ignored by modern 
philosophies of science that inform “gnostic” paradigms. A philosophy of science in 
a data-intensive paradigm may be forced to address them more directly. 

In Sect. 5, we highlighted the relationship between algorithmic opacity and agnos- 
tic aspects of science. There, we discussed how opaque, uninterpretable algorithms 
may prevent underlying causal connections between phenomena from being inferred 
from correlations in data. If algorithms become responsible on a wide scale for recog- 
nising correlations in data, their interpretability becomes essential to understand the 
explanatory grounds of those correlations. Thus a theory-free scientific paradigm, or 
a paradigm in which algorithms start to play a more autonomous role in the scientific 
method, ought to concern itself with developing frameworks for addressing black box 
problems and their imports. 


7 Conclusion 


In this article, we have provided a systematic and integrated review of the current land- 
scape of the epistemology of data science. We have focused on its critical evaluation 
and identifying and characterising some of its pressing or obvious gaps wherein philo- 
sophical interest lies. We have structured this reconstruction into five areas: descriptive 
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and normative accounts of the composition of data science; reflections upon the kind 
of enquiry that data science is; the nature and genealogy of the knowledge that data 
science produces; “black box” problems; and the nature and standing of a new frontier 
within the philosophy of science that is raised by data science. Each of these areas is 
home to a variety of important issues and active debates, and each area interacts with 
the others. The resulting picture is a rich, interconnected, and flourishing epistemol- 
ogy, which will continue to expand as both philosophical and technological progress 
is made, possibly influencing other interconnected views about the nature of science 
and its foundations. 
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Appendix: Details of literature search 


This analysis of the debate about the epistemology of data science was conducted 
through a systematic and integrated review of the relevant literature (Grant & Booth, 
2009). This type of review entails a comprehensive search process, allowing the incor- 
poration of multiple study types (Grant & Booth, 2009). This is suitable for this case, 
where the reviewed scholarship included reviewing documents from various disci- 
plines and different study types. The relevant documents were collected primarily from 
three top research databases: Google Scholar, PhilPapers, Scopus. Further references 
were added by following the literature cited in the selected papers. The literature was 
restricted exclusively to documents written in English. The impact of this choice on 
the analysis is likely minimal as we do not expect a considerable geographic, cultural, 
or linguistic variation in foundational questions in data science, given the discipline’s 
global, highly interdisciplinary, and contemporary nature. This is, perhaps, unlike 
other philosophical issues (e.g., ethical studies), which may be more sensitive to such 
variation in genealogy or circumstance. 

The literature search for the present work was conducted as described by the fol- 
lowing diagram: 


A Springer 


469 Page 24 of 27 Synthese (2022) 200:469 


Stage 1: Preliminary exploratory reading 


The consultation of internal Oxford Internet Institute articles and 
3 relevant review publications was undertaken to establish the 


$ 


Stage 2: Structured database search 


scope of this paper. 


3 databases (PhilPapers, Scopus and Google Scholar) were 
searched using the in query scheme in Figure 3. This yielded 
~500 papers given appropriate truncations. 


Stage 3: Screening 


Papers were first screened by title (yielding ~100 relevant results) 
and then by abstract. Duplicates and papers exclusively concerning 
the ethical, political and legal aspects of data science (for reasons 
explained in the paper body) were then discarded. This yielded 
~50 papers. 


$ 


Stage 4: Analysis + bibliography search 


Papers, or their relevant sections, were analysed and key concepts 
extracted. 


~40 relevant papers from bibliography inspections were added to 
the literature pool, yielding a total of ~90 papers which were used 
in the final analysis. 


Scheme of literature search 


Database Search query 


Philpapers “data science” 


“data science” & 
“epistemology” 
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Database Search query 


“big data” 


Scopus, google scholar “data science” & 
“epistemology” 


“data science” & 
“philosophy” 

“big data” & 
“epistemology” 


“big data” & “philosophy” 


Table of search queries 
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